PCA
A Principle Component Analysis (PCA) and Principle Component Regression (PCR) seemed suited to this dataset. The purpose of this technique is to decrease the number of variables while accounting for collinearity. Within this dataset there were 10 variables to explain IncomePerCap. However, the correlation matrix showed notable correlation between some of the predictor variables. For example, Professional had notable correlations with Service, Construction, Production and Unemployment, White had notable correlations with Hispanic and Black, etc. From this inital overview of the correlation matrix, the prospect of PCA seemed suitable and was continued.

The biplot above analyzed over 70k+ data points, resulting in the dense scattering of data. The axes of this plot were PC1 on the horizontal and PC2 on the vertical. PC1 had the most variation, between approximately -5 to 10, while PC2 went between -8 to 10. The variables White, Production, Unemployment, Black and Professional were pretty evenly split up between Pc1 and PC2. Other variables, such as Office, Service, Unemployed, Construction, etc. were majorly represented in either PC1 or PC2.
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.7878 1.3389 1.1653 1.0355 0.88819 0.82267 0.76933
Proportion of Variance 0.3196 0.1792 0.1358 0.1072 0.07889 0.06768 0.05919
Cumulative Proportion 0.3196 0.4989 0.6347 0.7419 0.82078 0.88845 0.94764
PC8 PC9 PC10
Standard deviation 0.71304 0.12303 0.003342
Proportion of Variance 0.05084 0.00151 0.000000
Cumulative Proportion 0.99849 1.00000 1.000000
The breakdown of the variation explained by each component showed that just over 60% of the variation was accounted for within the first three components. However, except for the first component, the change in the amount of variation explained in each consecutive component was similar. This was further illustrated by the following graph.

Call:
lm(formula = IncomePerCap ~ ., data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-57889 -3154 -136 3093 39355
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.463 < 2e-16 ***
PC1 -4585.05 11.71 -391.701 < 2e-16 ***
PC2 -1454.29 15.63 -93.043 < 2e-16 ***
PC3 604.54 17.96 33.664 < 2e-16 ***
PC4 994.55 20.21 49.214 < 2e-16 ***
PC5 -878.20 23.56 -37.274 < 2e-16 ***
PC6 1377.18 25.44 54.140 < 2e-16 ***
PC7 -205.74 27.20 -7.564 3.96e-14 ***
PC8 -196.99 29.35 -6.712 1.93e-11 ***
PC9 3301.06 170.10 19.407 < 2e-16 ***
PC10 -3519.28 6262.21 -0.562 0.574
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69556 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.704e+04 on 10 and 69556 DF, p-value: < 2.2e-16
A full principle component regression was performed, and all except the last component were deemed significant. It was also notable that this regression explained 71.01% of the variance in the dataset according the the adjusted R-Squared. The strongest variable was of course PC1 with a t-value with a magnitude by far larger than the rest of the variables.
A plot of the R-Square values over the number of components explained the amount of variation explained in the independent variable, IncomePerCap, based off of the components. The steeper increase and then petering off that occured in the R-Square graph seemed to indicate that a significant amount of the variation of the data in regards to IncomePerCap was explained using just the first component. Based on the initial analysis of the R-Square graph, and the results of the regression it seemed appropriate to run a regression on just PC1 which resulted in a lower Adjusted R Square.
Call:
lm(formula = IncomePerCap ~ PC1, data = pcadata_pcr_rot)
Residuals:
Min 1Q Median 3Q Max
-42965 -4026 -442 3546 36606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 23.34 1121.0 <2e-16 ***
PC1 -4585.05 13.06 -351.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6157 on 69565 degrees of freedom
Multiple R-squared: 0.6393, Adjusted R-squared: 0.6393
F-statistic: 1.233e+05 on 1 and 69565 DF, p-value: < 2.2e-16
The results of this regression on just PC1 corroborated that the R-Square is smaller, at a value of 63.93%. The tradeoff between parsimony and description of these two potential models made the choice of model unclear. Assuming the more explanatory model, which accounted for the number of components included by adjusted R Square, was chosen, only one component would be removed. This was not an effective parsing down of variables. However, there would be low bias since only one component was dropped.
K- Means
List of 9
$ cluster : Named int [1:69567] 2 2 2 2 2 1 2 1 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:2, 1:11] -0.329 0.174 0.42 -0.222 -0.32 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:2] "1" "2"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:2] 1.15e+12 1.34e+12
$ tot.withinss: num 2.5e+12
$ betweenss : num 4.81e+12
$ size : int [1:2] 24086 45481
$ iter : int 1
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 2 clusters of sizes 24086, 45481
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3285506 0.4200874 -0.3199683 0.2372900 0.9282858 -0.6150829
2 0.1739950 -0.2224715 0.1694500 -0.1256649 -0.4916051 0.3257379
Office Construction Production Unemployment IncomePerCap
1 -0.01890396 -0.4302842 -0.6556146 -0.5268802 37598.33
2 0.01001123 0.2278715 0.3472029 0.2790272 20114.40
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 1 1 2 2 1 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 1.153649e+12 1.344211e+12
(between_SS / total_SS = 65.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 2 1 1 2 2 2 1 2 2 2 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:3, 1:11] 0.432 -0.24 -0.363 -0.575 0.34 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:3] "1" "2" "3"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:3] 4.33e+11 3.94e+11 3.93e+11
$ tot.withinss: num 1.22e+12
$ betweenss : num 6.09e+12
$ size : int [1:3] 27175 29638 12754
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 3 clusters of sizes 27175, 29638, 12754
Cluster means:
Hispanic White Black Asian Professional Service
1 0.4318377 -0.5751979 0.3952287 -0.15378102 -0.7689991 0.6127621
2 -0.2397814 0.3399978 -0.2076592 -0.02410908 0.1329131 -0.2111302
3 -0.3629095 0.4354829 -0.3595529 0.38368702 1.3296436 -0.8149861
Office Construction Production Unemployment IncomePerCap
1 -0.02605785 0.2974405 0.51204618 0.6194822 16576.79
2 0.07327428 0.0108065 -0.07894429 -0.3101802 27821.96
3 -0.11475466 -0.6588701 -0.90756657 -0.5991303 42759.54
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 1 3 3 2 2 2 1 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 2 2 3 2 3 1 3 3 2 2 2 2 1 1 2 1 1 1 1 1 1 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 433025278355 393691127581 392888818743
(between_SS / total_SS = 83.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 4 2 4 4 4 1 4 1 4 4 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:4, 1:11] -0.302 0.668 -0.374 -0.132 0.407 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "1" "2" "3" "4"
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:4] 1.76e+11 1.92e+11 1.73e+11 1.75e+11
$ tot.withinss: num 7.15e+11
$ betweenss : num 6.6e+12
$ size : int [1:4] 17574 17698 8266 26029
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 4 clusters of sizes 17574, 17698, 8266, 26029
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3016974 0.4066104 -0.28173021 0.1074823 0.5711741 -0.43512866
2 0.6680011 -0.8809881 0.57590840 -0.1651807 -0.9293827 0.83254425
3 -0.3738848 0.4389987 -0.37976189 0.4559152 1.5331982 -0.92442949
4 -0.1317654 0.1850702 -0.08076331 -0.1050413 -0.2406168 0.02128076
Office Construction Production Unemployment IncomePerCap
1 0.06629532 -0.2290380 -0.4319137 -0.46098899 32840.09
2 -0.05484069 0.3137655 0.5749681 0.89883620 14433.73
3 -0.17952039 -0.7693828 -1.0184110 -0.62970642 45780.77
4 0.04953752 0.1856318 0.2240905 -0.09992813 23412.84
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 2 4 4 4 1 4 1 4 4 4 4 4 2 4 1 4 2 1 1 4 4 1 2 4 4
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 4 1 1 3 1 1 4 1 3 4 1 1 4 4 4 4 4 2 2 2 2 2 4 4 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 4 4 2 4 4 4 2 2 2 4 4 4 2 2 4 4 4 2 4 2 4 4
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 175536566241 191571930523 172553341760 175374034150
(between_SS / total_SS = 90.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 5 1 1 1 5 5 1 4 1 5 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:5, 1:11] -0.00846 0.85002 -0.3769 -0.33227 -0.24884 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:5] 9.11e+10 1.06e+11 9.67e+10 9.02e+10 8.75e+10
$ tot.withinss: num 4.72e+11
$ betweenss : num 6.84e+12
$ size : int [1:5] 20760 12813 6192 11579 18223
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 5 clusters of sizes 20760, 12813, 6192, 11579, 18223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.008456987 -0.01428733 0.06737527 -0.12483068 -0.4552929 0.2049365
2 0.850017949 -1.08804866 0.68083922 -0.17734851 -1.0217088 0.9792310
3 -0.376900236 0.44391016 -0.39300833 0.48161313 1.6342530 -0.9824220
4 -0.332267736 0.42488521 -0.31698657 0.22115974 0.8637778 -0.5739916
5 -0.248840396 0.36049690 -0.22051300 -0.03726641 0.1329121 -0.2234518
Office Construction Production Unemployment IncomePerCap
1 0.02032925 0.25592838 0.38069044 0.09861782 20788.54
2 -0.07442558 0.32126962 0.59350367 1.09913874 13091.47
3 -0.21879800 -0.82200303 -1.06575585 -0.64529647 47520.79
4 0.02701923 -0.40030391 -0.64316728 -0.52575884 36236.88
5 0.08634809 0.01621363 -0.08018998 -0.33184070 27836.80
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 1 1 1 5 5 1 4 1 5 1 1 1 1 5 5 1 2 4 4 5 5 5 1 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
1 1 5 5 4 4 4 1 4 3 1 5 5 1 1 1 5 1 2 1 2 1 2 1 1 1
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
1 1 1 2 1 1 1 1 1 1 1 5 1 2 2 1 1 5 1 1 2 1 1
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 91107426890 106355408998 96685021931 90198216197 87531045107
(between_SS / total_SS = 93.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 6 4 4 6 6 2 4 2 6 6 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:6, 1:11] -0.349 -0.285 0.974 0.132 -0.385 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:6] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:6] 5.33e+10 5.39e+10 6.79e+10 5.20e+10 5.61e+10 ...
$ tot.withinss: num 3.36e+11
$ betweenss : num 6.98e+12
$ size : int [1:6] 8294 13096 9901 16290 4763 17223
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 6 clusters of sizes 8294, 13096, 9901, 16290, 4763, 17223
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3494717 0.4296992 -0.3360808 0.30654695 1.0897348 -0.68272359
2 -0.2854601 0.3976771 -0.2665621 0.05425283 0.4263884 -0.36762711
3 0.9739417 -1.2113147 0.7268560 -0.18905224 -1.0768116 1.07158297
4 0.1316581 -0.2288132 0.2195608 -0.13438783 -0.6075451 0.36540166
5 -0.3852568 0.4492850 -0.4005145 0.50197893 1.7117093 -1.02643109
6 -0.1925231 0.2792048 -0.1502203 -0.09190832 -0.1287054 -0.06945889
Office Construction Production Unemployment IncomePerCap
1 -0.029446162 -0.5329557 -0.7839551 -0.5661097 38948.40
2 0.089116820 -0.1457015 -0.3276215 -0.4288697 31179.25
3 -0.088194644 0.3291525 0.5983303 1.2390235 12157.33
4 0.007625546 0.2812103 0.4727567 0.2799036 18933.00
5 -0.254726418 -0.8568239 -1.1023498 -0.6538796 48913.98
6 0.060350087 0.1491981 0.1403863 -0.1974674 24809.23
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
6 4 4 6 6 2 4 2 6 6 6 4 4 4 6 2 6 3 1 1 6 6 2 4 6 6
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
4 4 2 2 1 2 1 4 1 5 6 2 2 6 4 4 6 4 3 4 3 4 3 4 4 4
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
4 4 6 4 4 4 4 4 4 4 4 6 4 4 4 4 4 6 4 4 3 4 6
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 53289943534 53860735274 67861554081 51981016101 56119303347 52399097534
(between_SS / total_SS = 95.4 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 2 5 7 7 2 2 7 3 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:7, 1:11] -0.396 -0.258 -0.313 1.054 0.237 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:7] 3.01e+10 3.39e+10 3.39e+10 5.06e+10 3.68e+10 ...
$ tot.withinss: num 2.52e+11
$ betweenss : num 7.06e+12
$ size : int [1:7] 3586 13111 9445 8258 13680 6091 15396
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 7 clusters of sizes 3586, 13111, 9445, 8258, 13680, 6091, 15396
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.3964506 0.4542494 -0.40764714 0.5282089 1.7925081 -1.0717282
2 -0.2576300 0.3733856 -0.23385201 -0.0269068 0.1755536 -0.2474022
3 -0.3125257 0.4200419 -0.30528640 0.1527982 0.6907586 -0.4893448
4 1.0539699 -1.2761352 0.73515083 -0.1940444 -1.1025060 1.1203953
5 0.2366030 -0.3958078 0.34172930 -0.1373108 -0.7010158 0.4859066
6 -0.3575666 0.4294729 -0.34914245 0.3685170 1.2778561 -0.7833195
Office Construction Production Unemployment IncomePerCap
1 -0.29189615 -0.895033912 -1.1399021 -0.65954880 50244.25
2 0.08656937 -0.003686199 -0.1156669 -0.35053069 28266.51
3 0.06459196 -0.301790762 -0.5299635 -0.49757889 34274.90
4 -0.08602009 0.329565942 0.5903558 1.33577584 11570.50
5 -0.02053170 0.292153979 0.5252348 0.41200162 17787.86
6 -0.07512151 -0.640358239 -0.8939002 -0.59387071 41486.24
[ reached getOption("max.print") -- omitted 1 row ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
2 5 7 7 2 2 7 3 7 7 7 7 5 5 7 2 7 5 3 3 2 2 2 5 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
5 5 2 2 6 3 3 7 3 6 7 2 3 7 5 5 2 5 4 5 5 5 4 5 7 5
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
5 5 7 5 5 7 7 5 5 5 5 2 5 5 5 7 5 2 5 7 4 5 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 30078367255 33918666154 33924690315 50620627295 36763050021 31979820579
[7] 34290620831
(between_SS / total_SS = 96.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 1 8 7 1 1 5 7 5 1 1 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:8, 1:11] -0.21 -0.398 -0.331 -0.359 -0.276 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:8] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:8] 2.38e+10 2.24e+10 2.47e+10 2.28e+10 2.44e+10 ...
$ tot.withinss: num 1.95e+11
$ betweenss : num 7.12e+12
$ size : int [1:8] 13055 3152 7742 5140 10623 6131 13192 10532
$ iter : int 2
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 8 clusters of sizes 13055, 3152, 7742, 5140, 10623, 6131, 13192, 10532
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.2099021 0.31065143 -0.17687237 -0.08368332 -0.08389187 -0.09881344
2 -0.3981141 0.45578358 -0.40960492 0.53311507 1.81785638 -1.08784127
3 -0.3314624 0.43062774 -0.31765082 0.20067007 0.83132143 -0.55740904
4 -0.3588452 0.42873072 -0.36110010 0.40713285 1.35698554 -0.82340123
5 -0.2764009 0.38868612 -0.25353852 0.02632578 0.34875186 -0.33062038
6 1.1604954 -1.34363273 0.72389507 -0.20529827 -1.11835632 1.17173776
Office Construction Production Unemployment IncomePerCap
1 0.06138421 0.1289094 0.1064936 -0.23047133 25340.47
2 -0.30249732 -0.9092239 -1.1486967 -0.66205029 50788.01
3 0.03930789 -0.3810941 -0.6273035 -0.52118182 35892.73
4 -0.10289042 -0.6828886 -0.9379571 -0.61013558 42677.39
5 0.09089521 -0.1005031 -0.2648395 -0.40948506 30223.86
6 -0.08558404 0.3344531 0.5598170 1.46755772 10698.00
[ reached getOption("max.print") -- omitted 2 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 8 7 1 1 5 7 5 1 1 1 7 7 8 1 5 7 8 3 3 1 1 5 7 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
8 7 5 5 4 5 3 7 3 4 1 5 5 1 7 7 1 7 6 8 8 8 8 7 7 7
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
8 7 7 8 7 7 7 8 7 8 7 1 7 8 8 7 7 1 7 7 6 7 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 23806194051 22351110068 24674184543 22763698227 24425126210 32229516603
[7] 22734152115 22364091140
(between_SS / total_SS = 97.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 7 3 3 7 9 1 3 1 7 7 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:9, 1:11] -0.294 -0.347 0.049 -0.402 1.22 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:9] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:9] 1.54e+10 1.52e+10 1.73e+10 1.58e+10 2.42e+10 ...
$ tot.withinss: num 1.55e+11
$ betweenss : num 7.16e+12
$ size : int [1:9] 8025 5905 11753 2719 4994 4354 12230 9140 10447
$ iter : int 3
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
K-means clustering with 9 clusters of sizes 8025, 5905, 11753, 2719, 4994, 4354, 12230, 9140, 10447
Cluster means:
Hispanic White Black Asian Professional Service
1 -0.29398811 0.4084707 -0.2881384 0.09801206 0.5412472 -0.41887898
2 -0.34732990 0.4326622 -0.3266366 0.26353963 0.9956851 -0.63591035
3 0.04903299 -0.1030093 0.1312846 -0.13421926 -0.5427295 0.27794329
4 -0.40241766 0.4583190 -0.4144673 0.54771279 1.8491328 -1.10706894
5 1.21957721 -1.3726251 0.7042742 -0.21976591 -1.1201758 1.19133726
6 -0.35966388 0.4293714 -0.3692219 0.42709539 1.4300252 -0.86183629
Office Construction Production Unemployment IncomePerCap
1 0.092171984 -0.214882278 -0.42684953 -0.4601316 32552.93
2 -0.003486057 -0.478374937 -0.72844821 -0.5525047 37694.40
3 0.017290823 0.273752980 0.44804925 0.1820710 19710.62
4 -0.310870838 -0.926207144 -1.16439186 -0.6642217 51363.70
5 -0.083922955 0.335218762 0.54026333 1.5507180 10154.85
6 -0.136970585 -0.719826670 -0.97229327 -0.6191718 43867.61
[ reached getOption("max.print") -- omitted 3 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
7 3 3 7 9 1 3 1 7 7 7 3 3 3 7 9 7 8 2 2 9 9 9 3 7 7
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
3 3 1 9 2 1 2 3 2 6 7 1 1 7 3 3 9 3 5 8 8 8 8 3 3 3
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
3 3 7 8 3 3 3 8 3 3 3 9 3 8 8 3 3 7 3 3 5 3 7
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 15416927509 15194151787 17298838728 15761413559 24236871148 16554689443
[7] 17160968891 17348442911 16348725566
(between_SS / total_SS = 97.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
List of 9
$ cluster : Named int [1:69567] 9 2 8 9 5 5 8 10 9 9 ...
..- attr(*, "names")= chr [1:69567] "1" "2" "3" "4" ...
$ centers : num [1:10, 1:11] 1.293 0.172 -0.361 0.733 -0.269 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:10] "1" "2" "3" "4" ...
.. ..$ : chr [1:11] "Hispanic" "White" "Black" "Asian" ...
$ totss : num 7.31e+12
$ withinss : num [1:10] 1.67e+10 1.25e+10 1.39e+10 1.17e+10 1.19e+10 ...
$ tot.withinss: num 1.28e+11
$ betweenss : num 7.18e+12
$ size : int [1:10] 3740 9868 3958 7455 8799 2488 5279 10782 10229 6969
$ iter : int 2
$ ifault : int 4
- attr(*, "class")= chr "kmeans"
K-means clustering with 10 clusters of sizes 3740, 9868, 3958, 7455, 8799, 2488, 5279, 10782, 10229, 6969
Cluster means:
Hispanic White Black Asian Professional Service
1 1.29347073 -1.3926502 0.65110535 -0.227144565 -1.10439711 1.2206491
2 0.17193979 -0.2995527 0.27181346 -0.131665132 -0.65475827 0.4157717
3 -0.36067127 0.4329611 -0.37594612 0.433931771 1.47068572 -0.8852975
4 0.73251794 -1.0447516 0.74170741 -0.163014767 -1.02683185 0.9384102
5 -0.26944247 0.3834500 -0.24218980 -0.002215569 0.27358175 -0.2989392
6 -0.40143948 0.4574908 -0.41624907 0.552865396 1.86667764 -1.1156001
Office Construction Production Unemployment IncomePerCap
1 -0.076264666 0.32954885 0.47907560 1.65962676 9446.23
2 -0.005854058 0.28803780 0.50879858 0.34460804 18266.13
3 -0.150092261 -0.74228280 -0.99238368 -0.63110451 44529.44
4 -0.088022816 0.32539421 0.65367907 0.92950639 14162.91
5 0.094618993 -0.05725246 -0.20068842 -0.38834330 29357.81
6 -0.323599903 -0.93488726 -1.16996086 -0.66198939 51688.15
[ reached getOption("max.print") -- omitted 4 rows ]
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
9 2 8 9 5 5 8 10 9 9 8 8 2 2 9 5 8 4 10 7 9 9 5 2 8 8
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 51 52 53
2 2 5 5 7 10 10 8 10 3 8 5 10 8 2 2 9 2 4 2 4 2 4 2 8 2
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
2 2 8 4 2 8 8 2 2 2 8 9 2 4 2 8 2 9 2 8 4 2 8
[ reached getOption("max.print") -- omitted 69492 entries ]
Within cluster sum of squares by cluster:
[1] 16668900913 12479477177 13910959117 11660948482 11895068060 12674139695
[7] 12952720882 11959674961 11678714620 12133241415
(between_SS / total_SS = 98.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"

K-means is an unsupervised learning algorithm. The goal of this program is to find groups or clusters of data in order to identify certain patterns. All of the values in the data set were normalized along the normal distribution to make comparisons of the overall dataset on a similar scale. K-means was used for 2,3,4,5,6,7,8,9, and 10 clusters. On inspection of the clusters created from k=2, The cluster that had the highest IncomePerCap at 37598 had the highest cluster mean of professional at 0.928, White at 0.420 and Asian at 0.237. the cluster plot chart has all the 70,000 datapoints in green and the two different clusters in blue and red respectively.It appears that there is overlap of the clusters however this occurs as the plot takes all the different data points and plots them on a two dimensional graph. With only two clusters it captures about 65.8% of the cluster sum of squares. Further inspection was constructed for a model with k =3. The cluster with the highest IncomePerCap was found to be cluster three at 42760. this cluster also had the highest cluster mean for Professional at 1.330 and Asian at 0.3837. The first cluster which had a IncomePerCap cluster mean of 16577 had the highest uneployment cluster average at 0.619. the cluster plot has three distinct clusters portrayed and the overlap makes it a little difficult to see which cluster is which. With only three clusters, 83.3% of the data is captured which is a drastric improvement from only two clusters. A final analysis was constructed for a model with k=4. The cluster with the highest IncomePerCap was found to be cluster three with 45781. this cluster had the hgihest Professional cluster average at 1.533 and the highest Asian cluster averge at 0.456. The cluster with the lowest IncomePerCap was cluster two at 14434. It had the highest unemployment cluster average at 0.8988. The cluster plot is difficult to interpret as the all of the datapoints were brought to a two dimensional scale and now there are four different clusters. With only four clusters, 90.2% of the data is captured which is a drastric improvement from only two clusters. As the clusters increased from 5 to 10, the percentage captured did not increase drastically. For example when k= 10, 98.2% of the data is captured. So a cluster of fourr would be sufficient as it would capture a sufficient amount of the data.
Ridge Regression
[1] 15 100

the dataset had 10 dependent variables predicting the independent variable. Ridge regression was introduced as it minimized the residual sum of squares and has a shrinkage penalty of lambda times by the sum of squares of the coefficients. Overall as lambda increases, the coefficients apprach zero. this plot indicates the entire path of variables as they shring towards zero. To build the ridge regression, a log sacle grid for the lambda values was constucted from 10^10 to 10^-2 in 100 segments.
Train and Test sets
To avoid introducing a bias in developing the Ridge and Lasso regression a train and test data set were introduced. To simulate a train and test set there was a random split into 50% for the train set.

[1] 824.8974
lowest lamda from CV: 824.8974
MSE for best Ridge lamda: 12154666
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
19143.7093 -300.1618 436.8876 -34.8267 267.4880 1313.3468
Service Office Construction Production Unemployment
-988.9918 -358.9968 -465.5783 -707.8713 -806.0590
R^2:
[1] 0.8843423
Lasso Regression
In order to be the best model for Ridge regression, cross validation was implimented to find the best fit. The cross validation line graph indicates that a model with ten dependent variables would yield the lowest lambda with the lowest mean square error. As the lambda value decreases, the mean square error also decreases. Overall, Ridge Regression includes all the of the dependent variables and the best value for lambda is indicated by the first vertical line. The lowest lamda from the cross validation was found to be 825. The MSE for the best Ridge Lambda equation was 30834392. from the equation, the model that had the most positive coefficient valus were professional at 3248, white at 1104 and asian at 546. the values that had the strongest negative coeffiecents were service at -2195, production at -2021 and unemployment at -1602. It was interesting to note that only professional had a positive lambda while the other work variables were all negative. The R^2 value for the best Ridge model was found to be 0.707. this means that 70.7% of the variation in the income can be explained by the model.


lowest lamda from CV: 25.78395
MSE for best Lasso lamda: 11672982
All the coefficients :
(Intercept) Hispanic White Black Asian Professional
17484.872000 -212.822777 219.903152 0.000000 165.286025 1987.274255
Service Office Construction Production Unemployment
-284.732831 8.180852 0.000000 -4.656966 -619.976123
The non-zero coefficients :
(Intercept) Hispanic White Asian Professional Service
17484.872000 -212.822777 219.903152 165.286025 1987.274255 -284.732831
Office Production Unemployment
8.180852 -4.656966 -619.976123
[1] 0.8889258
Lasso regression was also implimented to see if this model would perform differently from the regression or Ridge model. Lasso regression can be useful in reducing over-fittness and assist in model selection. from the line plot it can be seen that the three most positive coefficient values are professional at 6030.2, white at 1690, and asian at 712.2. This means that professional, white and asian have much stronger positive pull on the data that the other variables. The three most negtive coefficient values are unemployment at -1613, service at -716.5, and producton at -622.3. Construction was found to have a coefficient value of 0.0 so it was removed for the final Lasso model. It is interesting to note that the lambda values for hispanic are small at 13.6 so they do not deviate much from the ordinary least squares model (OLS).
Cross validation was introduced to select the lambda value with the lowest MSE. The CV recommended eight dependent variables be used to predict income. The Lasso regresison recommended that construction be removed from the equation. the Cross validaiton value was found to be 16.2 and the MSE for the best Lasso model was 30709528. Also the r^2 value was found to be 0.708. this means that 70.8% of the variation in Income can be explained by the model.
Call:
lm(formula = IncomePerCap ~ ., data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-27748.0 -1894.5 -20.2 1770.9 18988.4
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17391.74 36.71 473.707 < 2e-16 ***
Hispanic 104.54 54.22 1.928 0.0539 .
White 669.66 72.93 9.183 < 2e-16 ***
Black 334.20 52.11 6.414 1.43e-10 ***
Asian 326.28 25.82 12.635 < 2e-16 ***
Professional 1077.33 2706.77 0.398 0.6906
Service -814.52 1609.32 -0.506 0.6128
Office -358.87 1173.52 -0.306 0.7598
Construction -378.99 1193.57 -0.318 0.7508
Production -515.94 1505.74 -0.343 0.7319
Unemployment -616.00 17.07 -36.087 < 2e-16 ***
ipc2High 19892.01 62.25 319.559 < 2e-16 ***
ipc4Mid-Low 5181.26 42.82 120.998 < 2e-16 ***
ipc4Mid-High -9860.07 41.82 -235.757 < 2e-16 ***
ipc4High NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3412 on 69553 degrees of freedom
Multiple R-squared: 0.8893, Adjusted R-squared: 0.8892
F-statistic: 4.296e+04 on 13 and 69553 DF, p-value: < 2.2e-16
Call:
lm(formula = IncomePerCap ~ Hispanic + White + Black + Asian +
Professional + Service + Office + Production + Unemployment,
data = datJLClean)
Residuals:
Min 1Q Median 3Q Max
-57889 -3155 -139 3092 39315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26167.82 20.93 1250.46 <2e-16 ***
Hispanic 972.98 87.56 11.11 <2e-16 ***
White 2983.19 117.35 25.42 <2e-16 ***
Black 1175.89 84.20 13.97 <2e-16 ***
Asian 1115.17 41.58 26.82 <2e-16 ***
Professional 5991.86 53.93 111.11 <2e-16 ***
Service -737.90 39.77 -18.55 <2e-16 ***
Office 359.14 28.63 12.54 <2e-16 ***
Production -667.56 40.62 -16.44 <2e-16 ***
Unemployment -1604.14 26.65 -60.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5519 on 69557 degrees of freedom
Multiple R-squared: 0.7102, Adjusted R-squared: 0.7101
F-statistic: 1.894e+04 on 9 and 69557 DF, p-value: < 2.2e-16
MSE for full model :
[1] 11638291
MSE for full model (w/o construction) :
[1] 30460435
An OLS model was consturcted for both the full model and the full model without the construction variable to compare them to the Ridge and Lasso models. the R^2 value for both the OLS models was found to be 0.71. this means that both the ordinary least squares models explain 71% of the variation in income can be explainedby the model. Furthermore the MSE for the full model was found to be 30459848. The full model withouth the construction variable was found to have a larger MSE at 30460435. Overall the Lasso, Ridge, and both OLS models explain aobut roughly the same amount of variability in the data. Also all of the R^2 values are about the same around 0.70. Since the full OLS has the lowest MSE and the highest R^2 it would be a more suitable option than the Ridge, Lasso, or OLS without construction.